Minimizing the number of episodes and Gallai's theorem 

on intervals 



Eva Czabarka 2,3 ' 5 , 
University of South Carolina, czabarka@math.sc.edu 

LaszloA. Szekely 1 ' 2 - 3 ' 4 ' 5 
University of South Carolina, szekely@math.sc.edu 
Todd Vision 1,3 
University of North Carolina, tjv@biol.unc.edu 

September 26, 2012 



Abstract 

In 1996, Guigo et al. [Mol. Phylogenet. Evol., 6 (1996), 189-203] posed 
the following problem: for a given species tree and a number of gene trees, 
what is the minimum number of duplication episodes, where several genes could 
have undergone duplication together to generate the observed situation. (Gene 
order is neglected, but duplication of genes could have happened only on cer- 
tain segments that duplicated). We study two versions of this problem, one of 
which was algorithmically solved not long ago by Bansal and Eulenstein pQ . We 
provide min-max theorems for both versions that generalize Gallai's archetypal 
min-max theorem on intervals, allowing simplified proofs to the correctness of 
the algorithms (as it always happens with duality) and deeper understanding. 
An interesting feature of our approach is that its recursive nature requires a gen- 
erality that bioinformaticians attempting to solve a particular problem usually 
avoid. 
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1 Introduction 



In 1996 Guigo et al. [7] posed the following problem: for a given species tree and a 
number of gene trees, what is the minimum number of episodes of gene duplication, 
where several genes could have duplicated in any single episode. The constraints of 
the problem include some vertices of the gene trees identified as duplication vertices; 
and duplication vertices have some associated intervals in the species tree, where 
a duplication of the gene represented by the gene tree must have taken place. We 
give mathematical definitions in Section [2j and explain the relevance of our results 
in Section [5j 

Several variants of this problem have been investigated: [6J, [2], [2] , [3], [ID], [5]. 
Bansal and Eulenstein [I] solved a version of this long-standing open problem with 
a greedy algorithm and proved the correctness of the algorithm by induction. 

The purpose of our note is to put these problems and results into proper combi- 
natorial context. There is no need to assume that the trees have no internal vertices 
of degree two or that they are binary. The intervals associated with the duplication 
vertices can be defined differently from the definitions in the biology literature. The 
greedy algorithm still works, furthermore, simple min-max theorems give a good 
characterization to the minimum number of episodes, even in this more general set- 
ting. As usual, the duality allows for more transparent proofs for the correctness of 
the optimization algorithms. 

The min-max theorems are straightforward generalizations to Gallai's Theorem 
on intervals (Gallai did not publish actually this theorem, and it was first printed in 
a paper of Hajnal and Suranyi [9j): 

Theorem 1 . 1 [Gallai] Let us be given a finite family of closed intervals on a straight 
line. Denote by v the size of the largest set of pairwise disjoint intervals, and by r 
the smallest number of points that can cover all intervals i.e. every interval contains 
at least one of the points. Then v = r holds. 

As a reminder, we reproduce here the proof to Gallai's Theorem, as our proofs 
were developed to generalize it. Clearly v < r, as disjoint intervals have to be covered 
by distinct points. We show that v points suffice to cover all intervals. Apply the 
following algorithm recursively until the interval system is empty: 

Pick the leftmost right endpoint from all right endpoints of intervals from the 
family, and delete all intervals from the system that this point covered. Add the 
picked point to the list of selected points. 

We picked some right endpoints of intervals that are pairwise disjoint by the con- 
struction, therefore these endpoints are at most v in number. These right endpoints 
cover all our intervals by the construction. 4k 
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Gallai's Theorem have been generalized by Suranyi (see [8]) essentially with the 
same proof: 

Theorem 1.2 [Suranyi] Let us be given a finite family of subtrees on a tree. Denote 
by v the size of the largest set of pairwise disjoint subtrees, and by r the smallest 
number of points that can cover all subtrees i.e. every subtree contains at least one 
of the points. Then v = t holds. 

2 Describing the combinatorial problems 

Let us be given a finite set X. The elements of X are called taxa. Let us be given a 
tree S with root R, such that the leaves of S are labelled with elements from X in 
a one-to-one manner. Root R is joined by an edge to oo. We call S the species tree 
for the taxa in X. When we talk about vertices of S, we exclude oo. 

Let us be given k gene trees, say for i = 1,2, ...,k the gene tree Gi. The leaves 
of Gi are labelled with some taxa from X, but a taxon can occur in more than one 
leaf, and not all taxa are necessarily represented by a leaf in Gi. A leaf corresponds 
to one taxon only We assume that Gi also has a root Ri and one more edge going 
from the root to ooj. ooj is not considered a vertex of the gene tree. 

Vertices of S have a natural partial order, namely u >s v, if u = v or u separates 
v from oo. When we speak about the interval of u and v in S, we mean the interval 
in this natural partial order. We call u the upper endpoint and v the lower endpoint 
of this interval. Vertices of Gi have a natural partial order >j defined similarly. >s 
and >j will refer to strict inequalities in these partial orders. 

Assume further that every Gi has a subset Di of its vertices specified that are 
called duplication vertices. For every i and every d £ Di, we have an associated 
path P, which is a subpath of a path connecting oo to a leaf in the species tree 
S. The ordered pair (P, d) will be called the duplication interval associated to the 
duplication vertex d € Di, and for more convenient notation we write it as P^. In 
this way we maintain names on the duplication intervals which tell which duplication 
vertex of which gene tree generated the duplication interval. The same intervals 
can have multiple names as duplication intervals: the same path P in S may be 
assigned as a duplication interval to vertices in different gene trees, and also to 
several pairs of <j-comparable (or not <j-comparable) duplication vertices of the 
same Gi. Duplication intervals with different names are considered distinct objects 
although their underlying intervals in S are the same. 

The following monotonicity assumption is made on the associated duplication 
intervals: 

VWd, e G Di d >i e — > (maxP^ >$ maxP e ) A (minP^ >s minP e ). (2-1) 
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We say that for two duplication intervals P e <i Pd, if e <j d for the duplication 
vertices e,d in Gj. A chain of duplication intervals is a sequence of duplication in- 
tervals associated to duplication vertices e\ <j <j ... <j e m for some i — 1,2, ...,/c. 
(We freely change between speaking about chains of duplication intervals and chains 
of duplication vertices in the gene trees, as they are in bijective correspondence.) 
We may have several copies of a path in S present as a duplication interval, and it 
depends on the label of a particular copy whether it satisfies a P e <i Pd type relation 
or belongs to a certain chain. Some copy of a path may do it, while another may 
not. Also, it may happen that a single path of S satisfies a strict P e <i Pd ordering 
with proper duplication vertices e,d 6 Di. 

Now we have the following models and optimization problems as: 
Discrete model. Let V(S) denote the vertex set of the species tree S. Let V*(S) 
denote the extension of V(S) by allowing unlimited number of copies of the vertices. 
We distinguish these copies from each other, but keep the information on which 
vertices of V* (S) are copies of the same vertex of V (S) . The elements of V* (S) inherit 
the >s partial order, if they are copies of different vertices, and are incomparable 
when they are copies of the same vertex. We denote this extended partial order by 
>s*. 

Consider maps / : U^ =1 D{ —> V*(S), which have the property that for all i, the 
restriction f\ D . : Di — > f(Di) preseves the partial order in the following sense: [d >i e 
implies that f(d) >s* /(e) or f(d) and /(e) are different copies of the same vertex]. 
A value of / is called a (duplication) episode. 

Objective: Minimize the quantity \f(U^ =1 Di)\ over all maps /, and/or charac- 
terize optimal solutions. (A fast algorithm for this minimization was discovered by 
Bansal and Eulenstein pQ.) 

We consider an alternative model as well: 

Continuous model. Consider the edges of S as line segments that have interior 
points. Let int(S) denote the union of the set of interior points of all edges of S. 
The partial order >$ on S naturally extends to S = V(S) U int(S). The extension 
will be denoted by ><j. 

Consider now maps / : U^ =l Di — >• S, which have the property that for all i, the 
restriction /m. : Di — > f{Di) strictly preserves the partial order [i.e. d >i e implies 
f(d) >g /(e)]. We still call the values of / (duplication) episodes. 

Objective: Minimize the quantity |/(U^ =1 -Dj)| over all maps /, and/or charac- 
terize optimal solutions. 

We say that a duplication interval in the species tree is degenerate, if it has only 
one point. For the continuous model we require, in addition, that the duplication in- 
tervals are non-degenerate, as otherwise the problem may not have a feasible solution 
at all. Note that the minimum number of episodes in the discrete and continuous 
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models can be different, even if all duplication intervals are non-degenerate. 

3 The new min-max theorems 

First we discuss the simpler discrete model. Let A be an arbitrary index set. For 
A G A, let C\ be a set of duplication intervals that make a chain with respect to 
one of the gene orders. We call {C\ : A G A} a disjoint chain packing, if for every 
A / A', elements of C\ and Cy do not share vertices in S, i.e. (UC\) n (UC^) = 0. 
We call 2^AeA I^a| the value of the disjoint chain packing. Fix now an arbitrary 
disjoint chain packing, {C\ : A € A}. Now the number of episodes needed is clearly 
at least as much as the value of this chain packing, as different members of a chain 
must belong to different episodes and vertex disjoint chains must use disjoint sets of 
episodes. 

Theorem 3.1 In the discrete model, the minimum number of duplication episodes 
equals to the maximum value of a disjoint chain packing. 

We will prove the other (non-trivial) inequality in the next section. 

We continue with the continuous model. For every A G A, let C\ be a set of 
duplication intervals that make a chain with respect to one of the gene orders. We 
call {C\ : A G A} an almost disjoint chain packing, if the following restrictions for 
intersections (in S) of elements from different chains, C\ and C' x , hold: 

(i) for any U G C\ and U' G C' x , we have \U n U'\ < 1; 

(ii) for any U G C\ and U' G Cy, \U DU'\ = 1 imply that the single element of 
U Pi U' is the >s (upper) endpoint of at least one of U and U'; 

(hi) if U G C\ and U' G Cy intersect in a single point that is the >$ endpoint of 
U, but not the >s endpoint of U' , then there is an R G Cy, such that V >y R, 
U fl R = U n U', and this common intersection point is the >s (upper) endpoint of 
R as well. 

Of course, duplication intervals from the same chain are allowed to intersect. 

Note that condition (iii) means that different chains from an almost disjoint chain 
packing, as sets in S, may only intersect at nodes of S, and if v is a node where several 
chains intersect, then it is the >s upper endpoint of all the chains that go through it 
with at most one exception. The chains that go through v all go down along different 
edges from v, and the exceptional chain must contain an interval that has v as its 
>s upper endpoint. 

For v G V(S), and an almost disjoint chain packing {C\ : A G A}, let £\(v) 
denote the number of chains C\, which have elements with upper endpoint v. Fix 
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an arbitrary almost disjoint chain packing, {C\ : A € A}. We call 



£ ( £ a(«) - 1) 



AeA 



t>eV(S),£(i;)>l 



the uaZue of the almost disjoint chain packing. Now the number of duplication 
episodes needed is clearly at least the value of the almost disjoint chain packing, as 
different members of any chain must belong to different episodes, disjoint intervals 
also must belong to different episodes, and for any vertex v with £\ > 1, we may 
use v as the episode for (no more than) one of the intervals from each of the £\{v) 
chains covering v. Thus, we can save on the duplication intervals containing v, by 
using v, £\{v) — 1 episodes, compared to not using v as an episode. 

Theorem 3.2 In the continuous model, the minimum number of duplication episodes 
equals to the maximum value of almost disjoint chain packings. 

We will prove the other (non-trivial) inequality in the next section. 

It is easy to see that Gallai's Theorem 1 1.1 1 is a special instance of both Theorems 
13.11 and 13.21 when the species tree is a path (only one taxon is present) and every 
gene tree has a single duplication event. 

4 Proofs 

As the algorithms and the proof of their correctness through the respective min-max 
theorems are very similar, we describe them in one text, and tell the differences as 
they arise. 

The proof is mathematical induction on the total number of duplication vertices 
in the gene trees. There is nothing to prove if none of the gene trees contain any 
duplication vertex, and in this case the empty (almost) disjoint chain packing suffices. 
The algorithm will remove the duplication designation of certain vertices in the gene 
trees, but not the vertices themselves; and will solve recursively the reduced problem 
with the reduced number of duplication vertices. We will also provide (almost) 
disjoint chain packing for the reduced problem, with the right value, such that the 
min-max theorem holds by the inductive hypothesis for the reduced problem. Then, 
case by case, we show that the number of episodes from the reduced problem plus 
the number of episodes created by our greedy algorithm in the reduction step equals 
to the size of an (almost) disjoint chain packing for the original problem. This will 
show simultaneously the optimality of our greedy algorithm and the truth of the 
corresponding min-max theorem. 

So we assume that we already know that the recursive algorithm solves the prob- 
lem optimally in any instance when the total number of duplication vertices is less 



6 



than the current amount and that in these instances a disjoint /almost disjoint chain 
packing can also be built with value equal to the minimum number of episodes.. 

Every duplication interval has an <s-upper endpoint. Find a <s-minimal among 
ah <s-upper duplication interval endpoints. Let this vertex of S be P. 

Discrete model: Let k > 1 be the largest integer such that P is <s-upper endpoint 
of each of the k elements of some chain <j for the order in a gene tree Gj, say 
L\ <j L2 <j • • • <j Lj,. Remove the duplication designation of any vertex d in 
any gene tree Gi, if P belongs to the duplication interval of d and no <j-chain of 
duplication vertices in Gi with maximum element d has length k + 1. 

By induction, the same recursive algorithm solves the reduced episode problem 
optimally such that the min-max theorem holds for the reduced problem. Add P 
with multiplicity k to the system of episodes. We construct recursively a disjoint 
chain packing providing the same value, through the following two cases: 

(i) no chain in the optimal disjoint chain packing for the reduced problem covers 
P. Add the chain {L\ <j L2 <j ■ ■ ■ <j L^} to the disjoint chain packing for the 
reduced problem — note that we still have a disjoint chain packing. We have the fol- 
lowing chain of inequalites: the minimum number of episodes in the original problem 
is at most the minimum number of episodes in the reduced problem +/c, which equals 
to the maximum value of a disjoint chain packing in the reduced problem +k, which 
is at most the maximum value of a disjoint chain packing in the original problem. 
We already know the trivial inequality for the min-max theorem, hence in this case 
our algorithm provides the same number of episodes as the value of a disjoint chain 
packing. 

(ii) a chain C in the optimal disjoint chain packing for the reduced problem covers 
P. Let the lowest element in the chain C correspond to the duplication interval U. 
By the choice of P, P must be in U. The duplication vertex d, which is responsible 
for U, has not been deleted from the list of duplication vertices. This means that 
d is the maximum element in a (k + l)-chain C of duplication vertices in his gene 
tree. Merge C and C into a single chain (it is possible as d was lowest element C 
but highest in C"), and add the merged chain to the optimal disjoint chain packing 
for the reduced problem to obtain a disjoint chain packing for the original problem. 
The number of episodes that we use for the original problem equals to the value of 
the disjoint chain packing that we constructed for the original problem. 

In both cases, we constructed a disjoint chain packing, whose value is the same 
as the number of episodes constructed, and hence the induction proof is complete. 

Continuous model: Note that P is not a leaf vertex in S, as duplication intervals 
in the continuous model are non-degenerate. Assume that e\, e2, eg are the edges 
leaving P in directions different from 00 in S. For j = 1,2,..., £, let Hj denote a 
longest chain of duplication intervals over all gene trees with the following properties: 
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(a) the upper endpoint of every duplication interval from the chain is P. 
(/?) every duplication interval of the chain uses the edge ej. 

Set dj = \Hj\. As we will not need the dj = terms, assume that only those edges 
leaving P are enumerated on which dj > 1, and for convenience those edges are still 
labelled as 1, 2, ...,£. 

We create a reduced problem by removing the duplication vertex designation of 
certain vertices in the gene trees. Assume that If is a duplication interval containing 
P, with vertex / G Di from the gene tree Gi. We remove the duplication designation 
of / and the duplication interval If if 

(a) P is the upper endpoint of //, or 

(b) P is a vertex of If but not an endvertex (so If uses some ej edge from P) and 
in Gi, there is no chain of duplication vertices of length exceeding dj in which If is 
the top element, or 

(c) P is the lower endpoint of If, and for every j = 1, 2, ...,£, in Gi there is no chain 
of duplication vertices of length exceeding dj in which If is the top element and all 
other elements have duplication intervals passing through ej. 

By mathematical induction, the same recursive algorithm solves the reduced 
episode problem optimally such that the min-max theorem holds for the reduced 
problem. Add to the list of episode locations the following points in S: P itself and 
dj — 1 distinct points from the interior of ej for every j = 1,2, ...,£. We construct 
recursively an almost disjoint chain packing providing the same value, through the 
following two cases: 

(i) no chain in the optimal almost disjoint chain packing for the reduced prob- 
lem covers P. Add to this almost disjoint system of chains, which provides the 
min-max result for the reduced episode problem by hypothesis, a length dj chain 
for every j = 1, 2, ...,£ from a gene tree, such that every duplication interval of this 
length dj chain has upper endpoint P and uses the edge ej. It is easy to see that 
we obtained an almost disjoint chain packing for the original problem. Simple cal- 
culation shows, analogously to the discrete case, that from the min-max result for 
the reduced problem, we obtain that in original problem the number of episodes 
equals to the value of the almost disjoint chain packing, as both sides increase by 

i + E^iK--i) = (E-=i^)-(^- 1 )- 

The alternative of (i) is that one or more chains in the optimal almost disjoint 
chain packing for the reduced problem covers P. Observe that P may belong to two 
chains only if P is upper endpoint of some elements of one of the chains. However, 
(a) has removed those elements. We are left with: 

(ii) a single chain C (corresponding to some gene tree Gi) in the optimal almost 
disjoint chain packing for the reduced problem covers P. Let the smallest element 
in this chain be U, so P € U G C by the choice of P. 
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The duplication vertex d that is responsible for U = Ud has not been deleted. 
According to the removal rules, P cannot be the upper endpoint of U. Therefore P 
is an internal point or lower endpoint of U. 

If P is an internal point of U and U passes through ej, then there is a chain C of 
duplication vertices in the gene tree corresponding to U, in which U is the (dj + l) th 
element. Create the almost disjoint system of chains for the original problem from 
the optimal almost disjoint system of chains for the reduced problem in the follow- 
ing way: replace C with C U C, which is still a chain; and for t = 1,2, ...,£, t ^ j, 
add the chain H t . Indeed, we obtain an almost disjoint chain packing. Simple cal- 
culation shows, analogously to the discrete case, that from the min-max result for 
the reduced problem, we obtain that the number of episodes equals to the value of 
the almost disjoint chain packing in the original problem, as both sides increase by 

i + ES=i(d,--i) = (ES=i4i) 

If P is the lower endvertex of the duplication interval U that comes from the gene 
tree Gj, then for some j, there is a (dj + l)-chain C in Gi with top element U, 
and all other elements of this chain use ej in their duplication intervals. Create the 
almost disjoint system of chains for the original problem from the optimal almost 
disjoint system of chains for the reduced problem in the following way: replace C 
with C U C , which is still a chain; and for t = 1,2, ...,£,t ^ j, add the chain Hf. 
Indeed, we obtain an almost disjoint chain packing. Simple calculation shows, anal- 
ogously to the discrete case, that from the min-max result for the reduced problem, 
we obtain that the number of episodes equals to the value of the almost disjoint 
chain packing in the original problem, as both sides increase by 1 + Y?,j=i(dj ~ !)• 

Cases (i) and (ii) together will prove the correctness of the algorithm and the 
min-max theorem for the original problem. 

There is one more thing to check, namely that the episodes selected for the 
reduced problem are distinct from the episodes selected at P and on the ej down- 
edges for j = 1,2, ..., £. The episodes selected in internal vertices of the ej edges 
are no longer in any duplication interval in the reduced problem, and therefore they 
cannot be selected there. In the reduced problem, the non-internal vertices selected 
for episodes are upper endpoints of some duplication intervals, while P is no longer 
the upper endpoint of any duplication interval in the reduced problem. 4fc 

5 Relevance for bioinformatics 

Ohno was among the first to recognize the importance of gene and genome duplica- 
tions |13j , and the resulting opportunity for evolutionary change afforded by genetic 
redundancy. Gene duplication, and subsequent gene loss, are the primary drivers of 
changes in gene content. Rates of duplication and loss also have been shown to vary 
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among lineages, and gene loss in particular is greatly elevated after whole- genome 
duplication events, which have occurred many times in the evolution of the eukary- 
otes [3JITTJQ2] • (While gene content alone has been used for reconstruction of species 
trees [15J , it is very sensitive to parallel or convergent gains and losses, and has not 
seen wide application.) 

Gene trees may differ from the species tree and from each other because of re- 
peated gene duplication and gene loss. Gene loss may eliminate the gene from a 
species. A species may have more than one representation in the gene tree as a 
result of gene duplication. 

There are two possible explanations of finding duplicate genes: early genome 
duplications and subsequent substantial gene loss, and occasional duplication of 
small groups of consecutive genes, not requiring the assumption of substantial gene 
loss. The latter event is called duplication episode. Clearly both mechanisms are 
present, an early vertebrate tetraploidization seems generally accepted. (Duplication 
of "medium length" segments seem unlikely.) 

To reconstruct the likely history of the gene content, we should know the cost 
associated with genome duplication, gene loss, and duplication episodes. We do not 
know those costs. Minimization problems for duplication episodes look for a most 
parsimonious explanation. 

The bioinformatics literature identifies duplication vertices in the gene trees. For 
every duplication vertex, the LCA (least common ancestor) mapping designates a 
vertex in the species tree, which is a lower bound in the <s partial order for the point 
in the species tree S, where the duplication of the gene could have happened. There 
is no absolute upper bound on this gene duplication, but the more branchings follow 
the gene duplication in the species tree, the more gene losses should be assumed. 
Therefore a common parsimony approach is to allow the shortest duplication interval 
for this gene: the edge between the lower bound vertex and its parent in the species 
tree. The assumption f)2. 1 1) follows from the way of assigning duplication intervals 
to duplication points in pQ and before in the literature. 

In the theorems above we have assumed that all duplication intervals are closed 
and the upper and lower endpoints of the intervals are vertices of the species tree S. 
Let us first consider the continuous model. Without loss of generality we may assume 
that the duplication intervals end at vertices of S by simply subdividing edges of 
S if necessary, as we did not assume that S was binary. As to the assumption of 
the intervals being closed: the proof really only used that the intervals are closed 
upwards, i.e. they include their upper endpoint in the >s ordering. 

It is easy to see that the discrete model can be viewed as follows: using k copies 
of a vertex v is equivalent with placing k duplication episodes on the edge leading 
from v to its parent (not using the parent). Therefore in effect the discrete model is 
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equivalent with using assuming that all duplication intervals are of the form [x,y), 
where y is an ancestor of x. Thus, our theorem for the discrete model follows from 
the theorem on the continuous model. 

Bansal and Eulenstein p] solved the episode minimization problem for the dis- 
crete model. We leave the decision for the biology literature when the discrete or 
the continuous model is to be used. We provided duality results for both mod- 
els. Suranyi's Theorem 11.21 can be understood as a min-max result for the so-called 
episode clustering problem: we are given duplication intervals and we want to cover 
the duplication intervals with the minimum number of points, while we do not require 
that that duplication episodes follow in strict order. 
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